In the 21st century, people started travelling a lot, either for
business, leisure, or adventure, with the number of passengers served by
flights increasing exponentially. For the past 2 years, we were
restricted from traveling due to the Pandemic and now restrictions have
been eased and the demand to travel has gone back up.
The airlines deploy multiple machine learning models to update the
flight ticket price numerous times over a day. They utilize hundreds of
features available at their disposal, including all the flight details
like departure and arrival city, time, date, user traffic on the day of
booking, previous year’s demand over a specific period, etc. This led to
many travel hacks claiming to reduce ticket prices for passengers.
Here are the list of SMART Questions I am trying to answer:
The data has been downloaded from kaggle.com link to which has been provided in the reference section.
This dataset contains information about flight booking options from the website “Ease my trip” for flight travel between India’s top 6 metro cities. The data was collected for 50 days, from February 11th to March 31st, 2022.
It has 300,261 observations and 9 variables of the following:
Date : Date of departure
Dep_time : Departure time
From : Departing city
Arr_time : Arrival time
To : Arrival city
Airline : The name of the airline carrier for the journey.
Class : Business or economy seat.
Time_Taken : Flight duration
Stop : Number of stops between departure and arrival city.
Price : The cost of the ticket.
As with any data, the data I acquired needed to be cleaned and new features have been added to aid the analysis down the line.
The preprocessing :
In addition to the features present in the data, feature engineernig generated these new set of features
flight_code : flight code of a particular journey from source to destination
dep_time_cat : Departure time in category (Morning, Afternoon….)
arr_time_cat : Arrival time in category (Afternoon, Evening, Night….)
month : Month of the departure (February, March….)
day : Day of scheduled departure (Monday, Tuesday….)
Weekend: Is the departure day a weekend or not.
time_taken_hours : Duration of flight in hours
time_taken_minutes : Duration of flight in minutes
from_location : Geographical co-ordinates of departure city (Latitude, Longitude)
to_location : Geographical co-ordinates of arrival city (Latitude, Longitude)
distance : Distance of journey in miles
# # functions to preprocess
# #creating buckets for different departure times
# time_category <- function(x){
# time <- as.integer(strsplit(x,":")[[1]][1])
# if(time <= 5) {return("mid night")}
# else if(time >5 & time <=10) {return ("morning")}
# else if(time >10 & time <=15) {return ("afternoon")}
# else if(time >15 & time <=20) {return("evening")}
# else if(time >20){return("night")}
# }
#
#
# # co-ordinates of the city
# location <- function(city){
# if(city == "Delhi") {return("28.7041, 77.1025")}
# else if(city == "Mumbai") {return("19.0760, 72.8777")}
# else if(city == "Bangalore") {return("12.9716, 77.5947")}
# else if(city == "Kolkata") {return("22.5726, 88.3639")}
# else if(city == "Hyderabad") {return("17.3850, 78.4867")}
# else if(city == "Chennai") {return("13.0827, 80.2707")}
# }
#
#
# # Finding distance between two co-orinates
# find_distance <- function(x){
# from_location <- x["from_location"]
# to_location <- x["to_location"]
#
# from_lat <- as.double(strsplit(from_location,", ")[[1]][1])
# from_long <- as.double(strsplit(from_location,", ")[[1]][2])
# to_lat <- as.double(strsplit(to_location,", ")[[1]][1])
# to_long <- as.double(strsplit(to_location,", ")[[1]][2])
#
# distance <-round((distm(c(from_long,from_lat),c(to_long,to_lat), fun = distHaversine)[1]/1000),digits = 2)
# return(distance)
# }
#
#
# # Reading the data
# data <- read.csv("C:/Users/LEGION/OneDrive/Desktop/GW/sem_1/intro to Data Science/midterm/data/airline.csv")
#
#
# # Feature engineering new columns for the Month, Day, and weekend
# data$month <- months(as.Date(data$date))
# data$day <- weekdays(as.Date(data$date))
# data$weekend <- ifelse(data$day == "Sunday" | data$day == "Saturday", 1, 0)
#
# # Cleaning the feature price
# data$price <- gsub(",", "", data$price)
# data$price <- as.integer(data$price/80)
#
# # combining airline code and flight number to make the flight code
# data$flight_code <- paste(data$ch_code, "-", data$num_code)
#
#
# # removing unwanted columns
# data_1 <- subset(data, select = -c(ch_code, num_code))
# data_1$time_taken_hms <- hm(data_1$time_taken)
#
# # Engineering two new features, duration of flight in hours and in minutes
# data_1$time_taken_minutes <- hour(data_1$time_taken_hms)*60 + minute(data_1$time_taken_hms)
# data_1$time_taken_hours <- round(data_1$time_taken_minutes/60,2)
#
# data_1 <- subset(data_1, select = -c(time_taken))
#
#
#
# # Categorising the departure and arrrival times
#
# data_1["dep_time_cat"] <- apply(X = data_1["dep_time"], FUN = time_category, MARGIN = 1)
# data_1["arr_time_cat"] <- apply(X = data_1["arr_time"], FUN = time_category, MARGIN = 1)
#
#
# dep_time_1 <- as_hms(strptime(data_1$dep_time,format = "%H:%M"))
# data_1$dep_time <- dep_time_1
#
#
# arr_time_1 <- as_hms(strptime(data_1$arr_time,format = "%H:%M"))
# data_1$arr_time <- arr_time_1
#
# # Adding from and to co-ordinates to the cities
#
# data_1["from_location"] <- apply(X = data_1["from"], FUN = location, MARGIN = 1)
# data_1["to_location"] <- apply(X = data_1["to"], FUN = location, MARGIN = 1)
#
#
# # Adding new feature "distance" between departure and arrival city
# data_1['distance'] <- apply(X = data_1,FUN = find_distance, MARGIN = 1)
#
# # Rearranging the columns in correct order
# data_1 <- data_1[, c("airline","flight_code","date","from","dep_time","dep_time_cat","to","arr_time","arr_time_cat",
# "stop","class","price","month","day","weekend","time_taken_hms","time_taken_hours",
# "time_taken_minutes", "from_location","to_location","distance")]
#
#
# # Saving the dataframe into a .csv file
# # write.csv(df_1, "C:/Users/LEGION/OneDrive/Desktop/GW/sem_1/intro to Data Science/midterm/data/final_data.csv", row.names=FALSE)
df <- read.csv("../data/final_data.csv")
df$date <- as.Date(df$date, format = "%Y-%m-%d")
factor_cols <- c("airline","flight_code","from","to","dep_time_cat","arr_time_cat",
"stop","class","month")
for (col in factor_cols){
df[,col] <- as.factor(df[,col])
}
time_cols <- c("dep_time","arr_time")
for (col in time_cols){
df[,col] <- chron(times = df[,col])
}
df$day <- factor(df$day, levels= c("Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))
df$airline <- factor(df$airline, levels= c("Indigo","Air India", "GO FIRST", "Vistara", "AirAsia", "SpiceJet","StarAir","Trujet"))
xkabledplyhead(df, title = "Flight data")
| airline | flight_code | date | from | dep_time | dep_time_cat | to | arr_time | arr_time_cat | stop | class | price | month | day | weekend | time_taken_hms | time_taken_hours | time_taken_minutes | from_location | to_location | distance |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Air India | AI - 868 | 2022-02-11 | Delhi | 18:00:00 | evening | Mumbai | 20:00:00 | evening | 0 | business | 320 | February | Friday | 0 | 2H 0M 0S | 2.00 | 120 | 28.7041, 77.1025 | 19.0760, 72.8777 | 718 |
| Air India | AI - 624 | 2022-02-11 | Delhi | 19:00:00 | evening | Mumbai | 21:15:00 | night | 0 | business | 320 | February | Friday | 0 | 2H 15M 0S | 2.25 | 135 | 28.7041, 77.1025 | 19.0760, 72.8777 | 718 |
| Air India | AI - 531 | 2022-02-11 | Delhi | 20:00:00 | evening | Mumbai | 20:45:00 | evening | 1 | business | 528 | February | Friday | 0 | 24H 45M 0S | 24.75 | 1485 | 28.7041, 77.1025 | 19.0760, 72.8777 | 718 |
| Air India | AI - 839 | 2022-02-11 | Delhi | 21:25:00 | night | Mumbai | 23:55:00 | night | 1 | business | 556 | February | Friday | 0 | 26H 30M 0S | 26.50 | 1590 | 28.7041, 77.1025 | 19.0760, 72.8777 | 718 |
| Air India | AI - 544 | 2022-02-11 | Delhi | 17:15:00 | evening | Mumbai | 23:55:00 | night | 1 | business | 584 | February | Friday | 0 | 6H 40M 0S | 6.67 | 400 | 28.7041, 77.1025 | 19.0760, 72.8777 | 718 |
A high level overview of the data I am dealing with:
The Departure time ranges from 00:10 AM in the Morning to 23:55 PM in the Night.
And the Arrival time ranges from 00:05 AM in the Morning to 23:59 PM in the Night.
There are 3 possible values for stops i.e 0(nonstop), 1, 2+ stops.
There are two classes of tickets Business and Economy.
Is the day of departure a weekend or not (has values 0 or 1)
Time taken for the journey ranging from 50 minutes to 49.8 Hours
The distance covered by the flight with a minimum of 181 to a maximum of 1100 miles.
#removing duplicated rows
df_1 <- distinct(df, airline,flight_code,date,from,dep_time,to,arr_time,class, .keep_all= TRUE)
#mapping the major cities of india
Delhi <- c(77.1025,28.7041)
Mumbai <- c(72.8777,19.0760)
Bangalore <- c(77.5947,12.9716)
Kolkata <- c(88.3639,22.5726)
Hyderabad <- c(78.4867,17.3850)
Chennai <- c(80.2707,13.0827)
# Data frame
data <- rbind(Delhi, Mumbai,Bangalore, Kolkata, Hyderabad, Chennai) %>%
as.data.frame()
colnames(data) <- c("long","lat")
# World map
map('world',
col="#d5d6d8", fill=TRUE, bg="white", lwd=0.05,
mar=rep(0,4),border=0,xlim = c(68,98),ylim=c(8,37) )
points(x=data$long, y=data$lat, col="black", cex=1.5, pch=20)
# Generate all pairs of coordinates
all_pairs <- cbind(t(combn(data$long, 2)), t(combn(data$lat, 2))) %>% as.data.frame()
colnames(all_pairs) <- c("long1","long2","lat1","lat2")
# add every connections:
for(i in 1:nrow(all_pairs)){
plot_my_connection(all_pairs$long1[i], all_pairs$lat1[i], all_pairs$long2[i],
all_pairs$lat2[i], col="black", lwd=1)
}
# add points and names of cities
text(rownames(data), x=data$long, y=data$lat, col="black", cex=0.5, pos=4)
I am trying to know the factors affecting the flight prices in these major cities of India.
Delhi : The Capital of India
Mumbai: The Financial capital of India
Kolkata: Biggest economy after Mumbai and Delhi
Bangalore: Known as silicon valley of India
Hyderabad: The city of Nizam’s and booming Tech Industry
Chennai: Known as the Detroit of Asia, for having several automobile manufacturing companies
xform <- list(categoryorder = "array",
categoryarray = c("Indigo","Air India", "GO FIRST", "Vistara",
"AirAsia", "SpiceJet","StarAir","Trujet"))
fig <- plot_ly(x = air_line_cout$Var1, y = air_line_cout$Freq, type = 'bar', color = air_line_cout$Var1,
colors = c("Indigo" = "#1f77b4","Air India" = "#ff7f0f", "GO FIRST" = "#2ca02c", "Vistara" = "#9467bd",
"AirAsia" = "#d62728", "SpiceJet" = "#8c564b","StarAir" = "#e377c2","Trujet" = "#7f7f7f"))
fig <- fig %>% layout(xaxis = list(title = "Airline",xform), title = "Number of flights scheduled by airlines",
yaxis = list(title = "flights scheduled every day"), showlegend = F)
# api_create(fig, filename = "Flight traffic share of each airlines")
fig
Indigo, one of the largest airlines in India has most flight traffic of 414 Scheduled flights everyday between these 6 cities , followed by Air India with 141 flights scheduled (recently acquired by TATA, a multi national conglomerate), and budget airlines like Go First, AirAsia and SpiceJet.
With Least number of flights operated by StarAir and TruJet as they are the latest entries to the airline industry, with just couple of flights in their fleet.
# number of flights between cities per day
city_freq <- rename(count(df_for_count, from, to), Freq = n)
city_freq <- city_freq[order(-city_freq$Freq),]
city_freq$Freq <- as.integer(city_freq$Freq/49)
row.names(city_freq) <- NULL
xkabledplyhead(city_freq, title = "Daily flight frequencies")
| from | to | Freq |
|---|---|---|
| Delhi | Mumbai | 168 |
| Mumbai | Delhi | 139 |
| Bangalore | Delhi | 113 |
| Kolkata | Delhi | 88 |
| Delhi | Bangalore | 62 |
The Highest traffic is seen between the cities Delhi -> Mumbai , Mumbai -> Delhi and followed by Bangalore -> Delhi.
This Sankey chart shows the flow of flights (flight traffic) in between the cities.
SankeyDiagram(city_freq[, -3],link.color = "Source",
weights = city_freq $Freq,label.show.percentages = TRUE,
variables.share.values = TRUE)
density <- density(df$price)
economy <- df[df$class == "economy",]
density_economy <- density(economy$price)
business <- df[df$class == "business",]
density_business <- density(business$price)
fig <- plot_ly(x = ~density_economy$x, y = ~density_economy$y, type = 'scatter', mode = 'lines',
name = 'Economy', fill = 'tozeroy')
fig <- fig %>% add_trace(x = ~density_business$x, y = ~density_business$y, name = 'Business', fill = 'tozeroy')
fig <- fig %>% layout(title = "Price distribution of economy and business class",xaxis = list(title = 'Price',range=c(0,1100)),yaxis = list(title = 'Density'))
fig
It is a fact that the passenger experience in business class is far superior than the experience in economy class, it is because of the service, food and the in flight entertainment. This all costs money and it is clearly visible from the plot above that ticket prices of business class lie towards the middle and right-side of the price scale ranging from as low as $200 to as much as $1000+, where as the whole of economy class lies almost to the left of $200.
xform <- list(categoryorder = "array",
categoryarray = c("Indigo", "Air India","Vistara", "AirAsia","SpiceJet","TruJet"))
fig1 <- plot_ly(x = economy$airline, y = economy$price,type = "box", name = "economy")
fig1 <- fig1 %>% layout(title = "price range of airlines for economy and business class",
xaxis = list(title = "Airline", showline = T, linewidth = 0.5),
yaxis = list(title = "Price", showline = T, linewidth = 1,range=c(0,250)))
fig2 <- plot_ly(x = business$airline, y = business$price,type = "box", name = "business")
fig2 <- fig2 %>% layout(title = "price range of airlines for economy and business class",
xaxis = list(title = "Airline", showline = T, linewidth = 0.5),
yaxis = list(title = "price", showline = T, linewidth = 1,range=c(100,1100)))
fig2 <- fig2 %>% layout(xaxis = xform, yaxis = list(side = "right"))
fig <- subplot(fig1, fig2,shareX = T) %>% layout(title = 'Price range of business and economy tickets')
fig